classification metric
Computational frame analysis revisited: On LLMs for studying news coverage
Kunjar, Sharaj, Smith, Alyssa Hasegawa, Mckenzie, Tyler R, Mohbe, Rushali, Scarpino, Samuel V, Welles, Brooke Foucault
Computational approaches have previously shown various promises and pitfalls when it comes to the reliable identification of media frames. Generative LLMs like GPT and Claude are increasingly being used as content analytical tools, but how effective are they for frame analysis? We address this question by systematically evaluating them against their computational predecessors: bag-of-words models and encoder-only transformers; and traditional manual coding procedures. Our analysis rests on a novel gold standard dataset that we inductively and iteratively developed through the study, investigating six months of news coverage of the US Mpox epidemic of 2022. While we discover some potential applications for generative LLMs, we demonstrate that they were consistently outperformed by manual coders, and in some instances, by smaller language models. Some form of human validation was always necessary to determine appropriate model choice. Additionally, by examining how the suitability of various approaches depended on the nature of different tasks that were part of our frame analytical workflow, we provide insights as to how researchers may leverage the complementarity of these approaches to use them in tandem. We conclude by endorsing a methodologically pluralistic approach and put forth a roadmap for computational frame analysis for researchers going forward.
ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature
Roy, Aritra, Grisan, Enrico, Buckeridge, John, Gattinoni, Chiara
Since the advent of various pre-trained large language models, extracting structured knowledge from scientific text has experienced a revolutionary change compared with traditional machine learning or natural language processing techniques. Despite these advances, accessible automated tools that allow users to construct, validate, and visualise datasets from scientific literature extraction remain scarce. We therefore developed ComProScanner, an autonomous multi-agent platform that facilitates the extraction, validation, classification, and visualisation of machine-readable chemical compositions and properties, integrated with synthesis data from journal articles for comprehensive database creation. We evaluated our framework using 100 journal articles against 10 different LLMs, including both open-source and proprietary models, to extract highly complex compositions associated with ceramic piezoelectric materials and corresponding piezoelectric strain coefficients (d33), motivated by the lack of a large dataset for such materials. DeepSeek-V3-0324 outperformed all models with a significant overall accuracy of 0.82. This framework provides a simple, user-friendly, readily-usable package for extracting highly complex experimental data buried in the literature to build machine learning or deep learning datasets.
Consistent Multilabel Classification
Sanmi Koyejo, Nagarajan Natarajan, Pradeep K. Ravikumar, Inderjit S. Dhillon
Multilabel classification is rapidly developing as an important aspect of modern predictive modeling, motivating study of its theoretical aspects. To this end, we propose a framework for constructing and analyzing multilabel classification metrics which reveals novel results on a parametric form for population optimal classifiers, and additional insight into the role of label correlations. In particular, we show that for multilabel metrics constructed as instance-, micro-and macro-averages, the population optimal classifier can be decomposed into binary classifiers based on the marginal instance-conditional distribution of each label, with a weak association between labels via the threshold. Thus, our analysis extends the state of the art from a few known multilabel classification metrics such as Hamming loss, to a general framework applicable to many of the classification metrics in common use. Based on the population-optimal classifier, we propose a computationally efficient and general-purpose plug-in classification algorithm, and prove its consistency with respect to the metric of interest. Empirical results on synthetic and benchmark datasets are supportive of our theoretical findings.
Consistent Multilabel Classification
Multilabel classification is rapidly developing as an important aspect of modern predictive modeling, motivating study of its theoretical aspects. To this end, we propose a framework for constructing and analyzing multilabel classification metrics which reveals novel results on a parametric form for population optimal classifiers, and additional insight into the role of label correlations. In particular, we show that for multilabel metrics constructed as instance-, micro-and macro-averages, the population optimal classifier can be decomposed into binary classifiers based on the marginal instance-conditional distribution of each label, with a weak association between labels via the threshold. Thus, our analysis extends the state of the art from a few known multilabel classification metrics such as Hamming loss, to a general framework applicable to many of the classification metrics in common use. Based on the population-optimal classifier, we propose a computationally efficient and general-purpose plug-in classification algorithm, and prove its consistency with respect to the metric of interest. Empirical results on synthetic and benchmark datasets are supportive of our theoretical findings.
Improving Model Classification by Optimizing the Training Dataset
Tukan, Morad, Mualem, Loay, Netzer, Eitan, Sigalat, Liran
In the era of data-centric AI, the ability to curate high-quality training data is as crucial as model design. Coresets offer a principled approach to data reduction, enabling efficient learning on large datasets through importance sampling. However, conventional sensitivity-based coreset construction often falls short in optimizing for classification performance metrics, e.g., $F1$ score, focusing instead on loss approximation. In this work, we present a systematic framework for tuning the coreset generation process to enhance downstream classification quality. Our method introduces new tunable parameters--including deterministic sampling, class-wise allocation, and refinement via active sampling, beyond traditional sensitivity scores. Through extensive experiments on diverse datasets and classifiers, we demonstrate that tuned coresets can significantly outperform both vanilla coresets and full dataset training on key classification metrics, offering an effective path towards better and more efficient model training.
Performance Estimation in Binary Classification Using Calibrated Confidence
Kivimรคki, Juhani, Biaลek, Jakub, Kuberski, Wojtek, Nurminen, Jukka K.
Model monitoring is a critical component of the machine learning lifecycle, safeguarding against undetected drops in the model's performance after deployment. Traditionally, performance monitoring has required access to ground truth labels, which are not always readily available. This can result in unacceptable latency or render performance monitoring altogether impossible. Recently, methods designed to estimate the accuracy of classifier models without access to labels have shown promising results. However, there are various other metrics that might be more suitable for assessing model performance in many cases. Until now, none of these important metrics has received similar interest from the scientific community. In this work, we address this gap by presenting CBPE, a novel method that can estimate any binary classification metric defined using the confusion matrix. In particular, we choose four metrics from this large family: accuracy, precision, recall, and F$_1$, to demonstrate our method. CBPE treats the elements of the confusion matrix as random variables and leverages calibrated confidence scores of the model to estimate their distributions. The desired metric is then also treated as a random variable, whose full probability distribution can be derived from the estimated confusion matrix. CBPE is shown to produce estimates that come with strong theoretical guarantees and valid confidence intervals.
Algorithmic Accountability in Small Data: Sample-Size-Induced Bias Within Classification Metrics
Briscoe, Jarren, Kepler, Garrett, Deford, Daryl, Gebremedhin, Assefaw
Evaluating machine learning models is crucial not only for determining their technical accuracy but also for assessing their potential societal implications. While the potential for low-sample-size bias in algorithms is well known, we demonstrate the significance of sample-size bias induced by combi-natorics in classification metrics. This revelation challenges the efficacy of these metrics in assessing bias with high resolution, especially when comparing groups of disparate sizes, which frequently arise in social applications. We provide analyses of the bias that appears in several commonly applied metrics and propose a model-agnostic assessment and correction technique. Additionally, we analyze counts of undefined cases in metric calculations, which can lead to misleading evaluations if improperly handled. This work illuminates the previously unrecognized challenge of combinatorics and probability in standard evaluation practices and thereby advances approaches for performing fair and trustworthy classification methods.
Conditional Electrocardiogram Generation Using Hierarchical Variational Autoencoders
Sviridov, Ivan, Egorov, Konstantin
Cardiovascular diseases (CVDs) are disorders impacting the heart and circulatory system. These disorders are the foremost and continuously escalating cause of mortality worldwide. One of the main tasks when working with CVDs is analyzing and identifying pathologies on a 12-lead electrocardiogram (ECG) with a standard 10-second duration. Using machine learning (ML) in automatic ECG analysis increases CVD diagnostics' availability, speed, and accuracy. However, the most significant difficulty in developing ML models is obtaining a sufficient training dataset. Due to the limitations of medical data usage, such as expensiveness, errors, the ambiguity of labels, imbalance of classes, and privacy issues, utilizing synthetic samples depending on specific pathologies bypasses these restrictions and improves algorithm quality. Existing solutions for the conditional generation of ECG signals are mainly built on Generative Adversarial Networks (GANs), and only a few papers consider the architectures based on Variational Autoencoders (VAEs), showing comparable results in recent works. This paper proposes the publicly available conditional Nouveau VAE model for ECG signal generation (cNVAE-ECG), which produces high-resolution ECGs with multiple pathologies. We provide an extensive comparison of the proposed model on various practical downstream tasks, including transfer learning scenarios showing an area under the receiver operating characteristic (AUROC) increase up to 2% surpassing GAN-like competitors.